Load and Visualize IBM Debater® Thematic Clustering of Sentences

Using the IBM Debater® Thematic Clustering of Sentences dataset, you will explore the dataset, then use it to create a model that dynamically groups sentences by their main topics and themes. This could be used in an application which collects customer feedback to help automatically organize the comments.

In this first notebook, you will load, explore, clean and visualize the data. You will then save the cleaned dataset to the Watson Studio project as a data asset to be loaded in Part 2 - Model Development to evaluate a K-Means clustering model.

The dataset contains 692 articles from Wikipedia, where the number of sections (clusters) in each article ranges from 5 to 12, and the number of sentences per article ranges from 17 to 1614.

Table of Contents

0. Prerequisites

Before you run this notebook complete the following steps:

Insert a project token

When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:

# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context

If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:

ws-project.mov

Import required modules

1. Load Data

This notebook uses one dataset from IBM Debater® Thematic Clustering of Sentences dataset named dataset.csv. The method below sets the path for the data, loads and reads the dataset that is already imported into the Watson Studio Project as a data asset.

dataset.csv:

This file contains 692 articles from Wikipedia, where the number of sections(clusters) in each article ranges from 5 to 12, and the number of sentences per article ranges from 17 to 1614.

Each row in the dataset is a Sentence which is from a SectionTitle and each SectionTitle is from an Article Title. The column Article Link is the original source of the sentence.

2. Preprocess Data

In order for this data to be used to evaluate a clustering model, clusters need to be assigned. According to the readme file of the dataset (found in the original dataset zip here), each cluster is each SectionTitle. That is, every sentence with the same section is in the same cluster. Thus, you can combine the Article Title and SectionTitle to get a group.

Two columns are added to the dataset to more easily show the clusters by giving each cluster a unique label:

Create a dictionary mapping the label ID to the label name.

Looking at the number of sentences that correspond to each cluster (label), you can see that one cluster has a lot more sentences.

Remove this cluster from the dataset to keep the groups together to test the model using the second notebook. Having this one very large group may not be an accurate representation of the real data.

Next, set the features to be Sentence, which is all the text data we are interested in. You will be predicting the label_id with the model in the second notebook. Below you see that there are 5554 (1 removed) clusters, and on average 8 sentences are in each cluster.

To test a model, break this dataset into smaller datasets because in the real world, you likely would not want to have 5000 unique clusters. So let's split the data so that each set has about 5 clusters. To do this, randomly take 5000 of the 5554 clusters, then split this into 1000 sets. Now there are 1000 sets to test on (list_of_groups).

3. Data Visualization

Each theme (cluster) has about 8 sentences on average.

You want a relatively uniform distribution of themes since you want roughly equal sized cluster data to set. In the histogram shown below, it appears that there is a uniform distribution.

Next, look at how many words are included in each theme (label). On average, the themes are 4 words long and the longest theme is 22 words. Fifty percent of themes are four or less words.

On average, the sentences used have about 21 words. To test the model in the second notebook, you would not want all really short or really long sentences since that is probably not likely to be seen in real comments. About 21 words is aligned with a normal average in a sentence. The histogram below also shows that the number of words is skewed to the right (more sentences are shorter rather than longer).

4. Save the Cleaned Data

Finally, save the cleaned dataset as a Project asset for later re-use. You should see an output like the one below if successful:

{'file_name': 'themes.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'ibmdebaterthematicclusteringofsen...',
 'asset_id': '...'}

and

{'file_name': 'groups_of_themes.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'ibmdebaterthematicclusteringofsen...',
 'asset_id': '...'}

Note: In order for this step to work, your project token (see the first cell of this notebook) must have Editor role. By default this will overwrite any existing file.

Next steps

Authors

This notebook was created by the Center for Open-Source Data & AI Technologies.


Copyright © 2021 IBM. This notebook and its source code are released under the terms of the MIT License.